feat(loss): add pg_loss aggregation modes by EazyReal · Pull Request #2090 · THUDM/slime

EazyReal · 2026-06-16T08:08:06Z

On main, slime has two built-in pg_loss normalizations: the default sample/rollout mean and the legacy --calculate-per-token-loss path. Prompt-group normalization and fixed-divisor normalization both require a custom pg_loss reducer today. That is fragile because the reducer has to compose correctly with CP slicing, micro-batch packing, and Megatron's final train-step divisor; a wrong constant factor silently changes the effective learning rate.

This PR moves the common pg_loss aggregation choices behind --loss-aggregation. For one train step, let N be the number of rollouts, G = n_samples_per_prompt, P = N / G, M_i be the valid-token count for rollout i, and L = --loss-aggregation-divisor.

mode	step scalar	behavior
`sample_mean`	`(1 / N) * sum_i token_mean(i)`	Keeps the current default behavior.
`token_mean`	`sum_i,t loss_it * mask_it / sum_i M_i`	Uses the existing `--calculate-per-token-loss` path under the new unified knob.
`prompt_mean`	`(1 / P) * sum_g token_mean(prompt_group_g)`	Groups rollouts by `Sample.group_index` and gives each prompt group one unit of weight.
`constant`	`sum_i,t loss_it * mask_it / (L * N)`	Uses a fixed token-scale divisor before the usual step average.

The implementation reuses slime's existing reducer path instead of adding a parallel loss stack. Rollout builds prompt_mask_sums only for prompt_mean; cp_utils.get_sum_of_sample_mean still owns CP-aware summation; and the train step keeps its existing final scaling. For prompt_mean, the reducer scales the prompt-group sum by n_samples_per_prompt, so users get the requested prompt mean directly instead of a constant-offset objective that would need learning-rate compensation.

Compared with the other OSS designs checked for this change, this follows slime's distributed constraints rather than copying a central reducer wholesale. verl's actor reducer receives global denominators directly; SkyRL implements the scalar by pre-scaling advantages; prime-rl exposes only the global token mean; and AReaL can reuse its engine-level local scalar + loss weight contract. slime's Megatron/CP path has a different invariant: CP ranks own token shards, while sample, prompt, and constant denominators are whole-step data and the final train-step divisor already has response-count semantics. Keeping the mode selection inside the existing CP-aware reducer preserves those packing and train-step scaling rules.

Startup validation rejects configurations that would silently use the wrong denominator, such as --loss-aggregation-divisor outside constant or --calculate-per-token-loss together with prompt_mean/constant. The custom pg_loss reducer hook remains available and keeps precedence for non-standard objectives.

The focused tests cover the final prompt_mean scalar, CP/sample-denominator invariance, argument validation, and rollout metadata construction. The customization guide documents the built-in modes in both English and Chinese.

Tested with:

uv run --no-project --with pytest --with torch --with numpy --with httpx --with pyyaml python -m pytest -q tests/test_cp_utils.py tests/test_megatron_argument_validation.py tests/test_rollout_validation.py
uv run --no-project --with ruff ruff check --config pyproject.toml docs/en/get_started/customization.md slime/backends/megatron_utils/loss.py slime/ray/rollout.py slime/utils/arguments.py tests/test_cp_utils.py tests/test_megatron_argument_validation.py tests/test_rollout_validation.py
uv run --no-project --with black black --check --line-length 119 slime/backends/megatron_utils/loss.py slime/utils/arguments.py

Add first-class pg_loss aggregation modes: sample_mean, prompt_mean, token_mean, and constant. The default remains sample_mean; --calculate-per-token-loss is reconciled as the legacy spelling of token_mean. This syncs the current THUDM/slime#2090 behavior for prompt_mean: rollout conversion emits prompt_mask_sums, the reducer scales by n_samples_per_prompt, and the final scalar is the direct mean over prompt groups. Startup validation requires global_batch_size to be a multiple of n_samples_per_prompt so prompt groups stay whole within a train step. constant requires --loss-aggregation-divisor and remains incompatible with the per-token path. The custom pg_loss reducer hook keeps precedence, and non-pg metrics keep the default sample-mean reducer. Validation: uv run --with pytest --with torch --with numpy --with httpx --with pyyaml --with ray --with huggingface_hub --with transformers --with pydantic pytest --confcutdir=tests/fast/backends/training_utils tests/fast/backends/training_utils/test_loss_aggregation.py -q Validation: uv run --with ruff ruff check miles/backends/training_utils/cp_utils.py miles/backends/training_utils/data.py miles/backends/training_utils/loss.py miles/backends/training_utils/loss_hub/losses.py miles/ray/rollout/train_data_conversion.py miles/utils/arguments.py miles/backends/megatron_utils/model.py miles/backends/experimental/fsdp_utils/actor.py tests/fast/backends/training_utils/test_loss_aggregation.py tests/fast/backends/training_utils/loss/test_loss_snapshot.py Validation: uv run --with black black --check miles/backends/training_utils/cp_utils.py miles/backends/training_utils/data.py miles/backends/training_utils/loss.py miles/backends/training_utils/loss_hub/losses.py miles/ray/rollout/train_data_conversion.py miles/utils/arguments.py miles/backends/megatron_utils/model.py miles/backends/experimental/fsdp_utils/actor.py tests/fast/backends/training_utils/test_loss_aggregation.py tests/fast/backends/training_utils/loss/test_loss_snapshot.py

EazyReal mentioned this pull request Jun 16, 2026

feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO) #2060

Closed

EazyReal force-pushed the upstream-pr/loss-aggregation-modes branch 3 times, most recently from 40d955f to 3774a73 Compare June 17, 2026 08:36

EazyReal marked this pull request as draft June 17, 2026 08:36

EazyReal marked this pull request as ready for review June 17, 2026 08:54

This was referenced Jun 17, 2026

docs: drop dangling Dr.GRPO custom-reducer example reference #2096

Merged

RFC: factor the policy loss into orthogonal axes (advantage × policy-loss × is-level × correction × regularizer) EazyReal/slime#1

Open

EazyReal changed the title ~~Add --loss-aggregation for the four ScaleRL pg_loss aggregation modes~~ feat(loss): add --loss-aggregation for the four ScaleRL pg_loss modes Jun 24, 2026

EazyReal force-pushed the upstream-pr/loss-aggregation-modes branch from 3774a73 to f23073b Compare June 24, 2026 03:18

feat(loss): add --loss-aggregation for the four ScaleRL pg_loss modes

0fe5e98

EazyReal force-pushed the upstream-pr/loss-aggregation-modes branch from f23073b to 0fe5e98 Compare June 24, 2026 04:24

EazyReal changed the title ~~feat(loss): add --loss-aggregation for the four ScaleRL pg_loss modes~~ feat(loss): add pg_loss aggregation modes Jun 26, 2026

fix(loss): normalize prompt_mean by prompt count

fa69ef7

EazyReal force-pushed the upstream-pr/loss-aggregation-modes branch from c3da906 to fa69ef7 Compare June 26, 2026 00:38

docs: add zh loss aggregation guide

9ca9ceb

EazyReal force-pushed the upstream-pr/loss-aggregation-modes branch from 7780bcf to 9ca9ceb Compare June 26, 2026 01:34

EazyReal mentioned this pull request Jun 27, 2026

feat(loss): add --loss-aggregation pg_loss modes radixark/miles#1350

Closed

EazyReal mentioned this pull request Jun 27, 2026

feat(loss): support pg_loss aggregation modes radixark/miles#1498

Open

This was referenced Jun 27, 2026

feat(loss): support pg_loss aggregation modes EazyReal/miles#1

Open

feat(loss): support pg_loss aggregation modes EazyReal/miles#2

Open

Merge remote-tracking branch 'upstream/main' into HEAD

fa29930

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(loss): add pg_loss aggregation modes#2090

feat(loss): add pg_loss aggregation modes#2090
EazyReal wants to merge 4 commits into
THUDM:mainfrom
EazyReal:upstream-pr/loss-aggregation-modes

EazyReal commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EazyReal commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 16, 2026 •

edited

Loading